Pediatric autoimmune neuropsychiatric disorder associated with group A streptococci (PANDAS) is a term used to describe a subset of children whose symptoms of obsessive compulsive disorder (OCD) or tic disorders are exacerbated by group A streptococcal (GAS) infection...
pandas provides high-performance, easy-to-use data structures and data analysis tools.
Tool used in the data science community, data types not strictly numerical
Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
pandas does not implement significant modeling functionality outside of linear and panel regression; for this, look to statsmodels and scikit-learn. More work is still needed to make Python a first class statistical modeling environment, but we are well on our way toward that goal.
pandas is under continious development, and improvements are added on a weekly basis. As with many other open source projects, you also learn a lot by just keeping up to date on the development (as a minimum, read the release notes of every new release).
In [1]:
import numpy as np
import pandas as pd
import csv
from matplotlib import pyplot as plt
from datetime import datetime
Lets load the CSV GPS data that has been introduced before
In [2]:
fname = 'gps.csv'
df = pd.read_csv(fname, header=0, skiprows=[1], quoting=2)
%timeit pd.read_csv(fname, header=0, skiprows=[1], quoting=2)
Although it is not relevant here since there are no quotes in the data set, we can also use the CSV dialect option
In [3]:
dia = csv.excel()
dia.quoting = csv.QUOTE_NONE
df = pd.read_csv(fname, header=0, skiprows=[1], dialect=dia)
Yes, very nice, but what does pandas actually understand about the data types in this case?
In [4]:
df.dtypes
Out[4]:
Time is an object. I guess that's nice and general...Lets see if we can force it into a datetime object instead
In [5]:
df = pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0])
df.dtypes
Out[5]:
Automatic date parsing is realy nice and easy, but it is SLOOOOWWWW. How slow exactly?
In [6]:
%timeit pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0])
Can we do that any faster? How about manually defining the parse date function?
In [7]:
date = '20131211 18:29:30'
print date[:4], date[4:6], date[6:8], date[9:11], date[12:14], date[15:17]
print datetime(int(date[:4]), int(date[4:6]), int(date[6:8]), int(date[9:11]), int(date[12:14]), int(date[15:17]))
print datetime.strptime(date, "%Y%m%d %H:%M:%S")
The read_csv function has an handy date_parser argument that allows the user to define her/his own date parse function. The default one is:
In [8]:
import dateutil.parser as parser
def date_parser_default(date):
date = parser.parse(date)
return date
Because the parser tries to actually understand the date format it is very slow. If we specify how to read the date field, it can go a lot faster. As a bonus, we could include any other correction or transformation operation on the date.
In [9]:
from datetime import datetime
def date_parser(date):
# create a datetime instance and assume constant formatting
# format: 20131211 18:29:30
return datetime(int(date[:4]), int(date[4:6]), int(date[6:8]), int(date[9:11]), int(date[12:14]), int(date[15:17]))
df = pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser)
In [10]:
%timeit pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser)
Looks a bit ugly, isn't the next one more Pythonic?
In [11]:
def date_parser_2(date):
return datetime.strptime(date, "%Y%m%d %H:%M:%S")
%timeit pd.read_csv(fname, header=0, skiprows=[1], parse_dates=[0], date_parser=date_parser_2)
More compact, but unfortunately not as fast.
In [12]:
df.dtypes
Out[12]:
In [13]:
print df[:10]
However, there is also some build-in IPython notebook magic that goes like this
In [14]:
df[:10]
Out[14]:
In [15]:
pd.options.display.mpl_style = 'default'
ax = df.plot()
In [16]:
plt.rcParams['figure.figsize'] = 12, 8
ax = df.plot(subplots=True, x='Time')
In [17]:
ax = df.plot(x='Latitude N', y='Longitude E', kind='scatter')
Latitude and Longitude could have only jumped around like this after using beam.me.up()
from the Scotty
package, which we didn't use here
Orientation should have a 0-360 range
Sampling rate is not constant
Lets try to deal with these isues.
First, index the data using the Time column in order to use more advanced selection, interpolation and group by functions provided by pandas.
Second, remove bad measurement samples by applying selection criteria on the data set
Third, re-sample the data set.
Fourth, fill in the gaps by interpolateing the data so we have a constant sample rate. See also the Working with missing data entry from the pandas manual.
In [18]:
df_indexed = df.set_index('Time')
print df_indexed[:3]
print df[:3]
In [19]:
#group1 = df.groupby('Latitude N')
#group1.filter(lambda x: x.mean() < -1.0)
#df.query("(orientation > 360.0)")
#df[(df.orientation > 360.0) | (df['Latitude N'] < -1.0)]
#df.query("(orientation > 360.0)")
#df[(df['Latitude N'] < -1.0) & (df.orientation < 360.0)]
#masked = df.mask(df['Latitude N']<0.0)
df_sel = df[(df['Latitude N'] > -1.0)]
# date selection
df_sel2 = df[ df.Time > datetime(2014,1,1) ]
#df[(df.a < df.b) & (df.b < df.c)]
#df.where(df['Latitude N'] < 0.0, np.nan, inplace=True)
#df.where(df['Longitude E'] < 10.0, np.nan, inplace=True)
#df.query("(orientation > 360.0)")
#df[(df['Longitude E'] < 10.0)]
#df['Latitude N'].mean()
In [20]:
ax = df_sel.plot(subplots=True, x='Time')
In [21]:
ax = df_sel.plot(x='Latitude N', y='Longitude E', kind='scatter')
See the API reference for more details on resample()
In [22]:
# re-sample the orientation to an hourly rate
df_resample = df_sel.set_index('Time').resample('H')
# lets have a look at the first and last rows
print df_resample[0:5]
print df_resample[-5:]
Grouping the data per time interval.
In [23]:
df_indexed = df_sel.set_index('Time')
# or use x.second, x.minute, x.hour, x.day, x.week, x.month, x.year
df_grouped = df_indexed.groupby(lambda x: x.month)
for name, group in df_grouped:
group.plot(x='Latitude N', y='Longitude E', kind='scatter')
In [24]:
df_grouped_date = df_indexed.groupby(lambda x: x.week)
for name, group in df_grouped_date:
print name, group.orientation.count()
group.plot()
In [25]:
df_grouped = df_indexed.groupby(lambda x: x.month)
for name, group in df_grouped:
print ('%3i %6i %6.1f %6.1f') % (name, group.orientation.count(), group.orientation.min(), group.orientation.max())
ax = group.orientation.hist(alpha=0.6, label=str(name), bins=range(0,361,20))
ax.legend(loc='best')
ax.set_xlim([0,360])
Out[25]:
Is the above histogram presenting a fair comparison between the 3 months? If not, can you use simple pandas methods to reduce the bias that possibly one month has over another...
In [26]:
#df1 = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
# 'foo', 'bar', 'foo', 'foo'],
# 'B' : ['one', 'one', 'two', 'three',
# 'two', 'two', 'one', 'three'],
# 'C' : range(8), 'D' : range(8)})
#df1